Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data
نویسندگان
چکیده
This research explores machine learning methods for the development of computer models that use gene expression data to distinguish between tumor and non-tumor, between metastatic and non-metastatic, and between histological subtypes of lung cancer. A second goal is to identify small sets of gene predictors and study their properties in terms of stability, size, and relation to lung cancer. We apply four classifier and two gene selection algorithms to a 12,600 oligonucleotide array dataset from 203 patients and normal human subjects. The resulting models exhibit excellent classification performance. Gene selection methods reduce drastically the genes necessary for classification. Selected genes are very different among gene selection methods, however. A statistical method for characterizing the causal relevance of selected genes is introduced and applied. Introduction and Problem Statement Lung cancer is the third most common cancer in the United States yet causes more deaths than breast, colon and prostate cancer combined (Parker et al. 1996). In spite of recent advances in treatment, approximately 90% of the estimated 170,000 patients diagnosed with lung cancer in 2002 are expected to eventually die of their disease. Major goals of lung cancer research is to understand the molecular basis of disease, to offer patients with better early diagnostic and therapeutic tools, and to individualize therapeutics based on molecular determinants of the tumors. The present research addresses three aims related to creating clinically and biologically useful molecular models of lung cancer using gene expression data: (a) Apply supervised classification methods to construct computational models that distinguish between: (i) Cancerous vs Normal Cells; (ii) Metastatic vs NonMetastatic cells; and (iii) Adenocarcinomas vs Squamous carcinomas. (b) Apply feature selection methods to reduce the number of gene markers such that small sets of genes can distinguish among the different states (and ideally reveal important genes in the pathophysiology of lung cancer). (c) Compare the performance of the machine learning (classifier and feature selection) methods employed in these dataset and tasks. (www.aaai.org). All rights reserved. Data and Methods Data. We analyzed the data of Bhattacharjee et al., which is a set of 12,600 gene expression measurements (Affymetrix oligonucleotide arrays) per patient from 203 patients and normal subjects. The original study explored identification of new molecular subtypes and their association to survival. Hence the experiments presented here do not replicate or overlap with those of (Bhattacharjee et al. 2001). Classifiers. In our experiments we used linear and polynomial-kernel Support Vector Machines (LSVM, and PSVM respectively) (Scholkopf et al. 1999), K-Nearest Neighbors (KNN) (Duda et al. 2001), and feed-forward Neural Networks (NNs) (Hagan et al. 1996). For SVMs we used the LibSVM base implementation (Chang et al.), with C chosen from the set: {1e-14, 1e-3, 0.1, 1, 10, 100, 1000} and degree from the set {2, 3, 4}. For KNN, we chose k from the range [1,...,number_of_variables] using our own implementation of the algorithm. For NNs we used the Matlab Neural Network Toolbox (Demuth et al. 2001) with 1 hidden layer, number of units chosen (heuristically) from the set {2, 3, 5, 8, 10, 30, 50}, variable learning rate back propagation, performance goal=1e-8 (i.e., an arbitrary value very close to zero), a fixed momentum of 0.001, and number of epochs chosen from the range [100,...,10000]. The number of epochs in particular was optimised via special scripts with nested cross-validation during training such that training would stop when the error in an independent validation set would start increasing. To avoid overfitting, either in the sense of optimising parameters for classifiers, or in the sense of estimating final performance of the best classifier/gene set found (Duda et al. 2001) a nested cross-validation design was employed. In this design, the outer layer of crossvalidation estimates the performance of the optimised classifiers while the inner layer chooses the best parameter configuration for each classifier). For the two tasks (adenocarcinoma-squamous, and normal-cancer) we used 5-fold cross-validation while for the metastaticnonmetastatic task we used 7-fold cross-validation (since we had only 7 metastatic cases in the sample). To ensure optimal use of the available sample, we required that data splits were balanced (i.e., instances with the rarer of the two categories of each target would appear in the same proportion in each random data split). FLAIRS 2003 67 Copyright © 2003, American Association for Artificial Intelligence Feature Selection. The feature (or variable) selection problem can be stated as follows: given a set of predictors (“features”) V and a target variable T, find a minimum subset F of V that achieves maximum classification performance of T (relative to a dataset, task, and a set of classifier-inducing algorithms). Feature selection is pursued for a number of reasons: for many practical classifiers it may improve performance; a classification algorithm may not scale up to the size of the full feature set either in sample or time; feature selection may allow researchers to better understand the domain; it may be cheaper to collect a reduced set of predictors; and, finally, it may be safer to collect a reduced set of predictors (Tsamardinos and Aliferis 2003). Feature selection methods are typically of the wrapper or the filter variety. Wrapper algorithms perform a heuristic search in the space of all possible feature subsets and evaluate each visited state by applying the classifier for which they intend to optimise the feature subset. Common examples of heuristic search are hill climbing (forward, backward, and forward-backward), simulated annealing, and Genetic Algorithms. The second class of feature selection algorithms is filtering. Filter approaches select features on the basis of statistical properties of their joint distribution with the target variable. We used two such methods: (a) Recursive Feature Elimination (RFE). RFE builds on SVM classification. The basic procedure can be summarized as follows (Guyon et al. 2002): 1. Build a linear Support Vector Machine classifier using all V features 2. Compute weights of all features and choose the first |V|/2 features (sorted by weight in decreasing order) 3. Repeat steps #1 and #2 until one feature is left 4.Choose the feature subset that gives the best performance 5. Optional: Give the best feature set to other classifiers of choice. RFE was employed using the parameters employed in (Guyon et al. 2002). (b) Univariate Association Filtering (UAF). UAF examines the association of each individual predictor feature (gene) to the target variable. The procedure is common in applied classical statistics (Tabachnick et al. 1989) and can be summarized as follows: 1. Order all predictors according to strength of pair-wise (i.e., univariate) association with target 2. Choose the first k predictors and feed them to the classifier We note that various measures of association may be used. In our experiments we use Fisher Criterion Scoring, since previous research has shown that this is an appropriate measure for gene expression data (Furey et al. 2000). In practice k is often chosen arbitrarily based on the limitations of some classifier relative to the available distribution and sample, or can be optimised via crossvalidation (our chosen approach). We used our own implementations of RFE and UAF. Performance Evaluation. In all reported experiments we used the area under the Receiver Operator Characteristic (ROC) curve (AUC) to evaluate the quality of the produced models (Provost, Fawcett and Kohavi 1998). Unlike accuracy (i.e., proportion of correct classifications) this metric is independent of the distribution of classes. It is also independent of the misclassification cost function. Since in the lung cancer domain such cost functions are not generally agreed upon, we chose to use the AUC metric. We note that by emphasizing robustness AUC also captures more the intrinsic quality of what has been learned (or is learnable) in the domain and in that sense can be considered more useful for biomedical discovery. We use our own Matlab implementation of computation of AUC using the trapezoidal rule (DeLong et al. 1998). Statistical comparisons among AUCs were performed using a paired Wilcoxon rank sum test (Pagano et al. 2000).
منابع مشابه
Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملSFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy
In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....
متن کاملGene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method
Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...
متن کاملPrediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods
Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...
متن کاملClassification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest
Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...
متن کاملDiagnosis of Breast Cancer Subtypes using the Selection of Effective Genes from Microarray Data
Introduction: Early diagnosis of breast cancer and the identification of effective genes are important issues in the treatment and survival of the patients. Gene expression data obtained using DNA microarray in combination with machine learning algorithms can provide new and intelligent methods for diagnosis of breast cancer. Methods: Data on the expression of 9216 genes from 84 patients across...
متن کامل